-
Notifications
You must be signed in to change notification settings - Fork 50
Add support for CUDA >= 12.9 #757
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for sending this PR @Kh4L, this is very helpful.
I applied our linter and also enabled testing for CUDA 12.9 so that we can correctly check the PR.
Do I understand correctly that in 12.9 we have to use the context-based API, while at the same time the context creation helper was removed?! This sounds error prone, is there any way we could avoid manually building and setting the context attributes?
@NicolasHug LMK if you need anything else! the assertion error doesn't seem related to my change |
Nothing else to do on your side @Kh4L , thank you. I'll merge this soon, I'll just try to extract all the |
…e. Leaving" This reverts commit 7056fc0.
Let me try to summarize the changes and add context for other reviewers, and for future reference. From their release notes, CUDA 12.9 deprecates:
Removing the non-context APIs makes sense to me. I don't understand the logic behind removing We now have to manually create the |
// NppStreamContext hStream and nStreamFlags should not be part of the cache | ||
// because they may change across calls. | ||
NppStreamContext nppCtx = createNppStreamContext( | ||
static_cast<int>(getFFMPEGCompatibleDeviceIndex(device_))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A note on the cache: I originally implemented the "cache" as a simple nppCtx_
attribute on the CudaDeviceInterface
class: 565896e
(#757)
But I don't think that would be correct: the CudaDeviceInterface
instance is global, and we only have one single instance for all CUDA devices. And we can't use one single NppContext
for all CUDA devices - we need one NppContext
per device.
So, we need a per-device cache for the NppContext
, similar to our existing hw_device_ctx
cache. I'm leaving that for an immediate follow-up.
@@ -265,37 +303,37 @@ void CudaDeviceInterface::convertAVFrameToFrameOutput( | |||
dst = allocateEmptyHWCTensor(height, width, device_); | |||
} | |||
|
|||
// Use the user-requested GPU for running the NPP kernel. | |||
c10::cuda::CUDAGuard deviceGuard(device_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This guard isn't needed anymore as we now explicitly pass the current device to the NppContext creation.
at::cuda::CUDAStream nppStreamWrapper = | ||
c10::cuda::getStreamFromExternal(nppGetStream(), device_.index()); | ||
nppDoneEvent.record(nppStreamWrapper); | ||
nppDoneEvent.block(at::cuda::getCurrentCUDAStream()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These syncs aren't needed anymore because we now explicitly ask Npp
to rely on pytorch's current stream.
Recent CUDA versions don't support non-context NPP calls, so use the ctx-based API calls.
Also
CUDA 12.9+
deprecatesnppGetStreamContext
, so we need to build the NPP context manually.